[Distributed Optimizer] Fix transpose creation when keep_fp8_weight_transpose_cache=False#501
Open
[Distributed Optimizer] Fix transpose creation when keep_fp8_weight_transpose_cache=False#501
Conversation
… add unit test when using distributed optimizers
ipanfilo
requested changes
Mar 21, 2026
| # Delayed scaling and per-tensor current scaling: if backend does not support | ||
| # non-transposed FP8 GEMM, pre-create the transpose. | ||
| if not is_non_tn_fp8_gemm_supported(): | ||
| if model_weight._quantizer.columnwise_usage and not is_non_tn_fp8_gemm_supported(): |
Collaborator
There was a problem hiding this comment.
Please comment or guard the changes
| quantizations.append("fp8_block") | ||
|
|
||
| manual_post_all_gather_processings = [False, True] | ||
| keep_fp8_weight_transpose_caches = [True, False] |
Collaborator
There was a problem hiding this comment.
It should be only True on CUDA and better name it keep_fp8_weight_transpose_cache - to match the name of parameter
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a bug where
post_all_gather_processingcreated transpose for Float8Tensor weights even whenkeep_fp8_weight_transpose_cache=False, leading to assertion failures in Linear forward.Problem
With
keep_fp8_weight_transpose_cache=False,quantizer.columnwise_usageis set to False (e.g. on ROCm/AMD).post_all_gather_processingwas still creating transpose via_create_transpose()because it did not respectcolumnwise_usage. This triggered the assertion in Linear forward:expected _transpose to be None or an empty tensor when transpose cache is disabled.Solution
post_all_gather_processinginutils.pyso it only creates transpose whenmodel_weight._quantizer.columnwise_usageis True.test_cast_master_weights_to_fp8withkeep_fp8_weight_transpose_cache=TrueandFalseto cover both cases.Testing
test_cast_master_weights_to_fp8withkeep_fp8_weight_transpose_cache=Falsenow passes.Type of change
Checklist: